- Complex data structures (matrices, lists and data frames)
- Functions in R
- Reading data from files
- vectors (string, number, integer, logic, factor)
- matrices and arrays
- lists
- data frames
2023-03-03
Questions, anybody?
samples <- c(1, 10, 23, 42, 13) samples_n <- length(samples)
Both c and length are functions. They take some arguments (often many of them) and return a single object: a vector, a matrix or something else.
You can always assign the result of the function to a variable.
Sometimes functions return NULL, which is R for “nothing”, but which is still something you can use or assign.
There is a lake in a garden. Every day, the water lilies cover twice as much area as the previous day. On the first day, the water lilies cover 1/100th of the area of the lake.
c() or :). Now assign the fraction of the area covered by water lilies on day n to variable yplot() function)abline(h=.5) (what does it do?) to show a graphical solutiony[3] do?y[4:5] do? (just try it!)plot and a color name (e.g. “red”) to change the color of the line (plot(..., col="red"))lines(x, y) function to put a second line on the plotIn R, there are two special values: TRUE and FALSE. They can be used to create logical vectors.
sel <- c(TRUE, TRUE, TRUE, TRUE, FALSE) sel !sel
Comparison operators (>, <, <=, >=, ==, !=) produce logical vectors:
samples <- c(1, 1, 2, 5, 7) samples > 2 which(samples == 7) which(samples != 1)
Logical vectors can be used to access elements:
persons <- c("Aphrodite", "Bacchus", "Circe", "Demeter", "Eurypides")
sel <- c(TRUE, TRUE, TRUE, TRUE, FALSE)
persons[sel]
# we can abbreviate the TRUE and FALSE to T and F (avoid)
greek <- persons[ c(T, F, T, T, T) ]
Create a vector as follows:
samples <- c(1, 10, NA, 15)
NA stands for not available (e.g., missing data)
length(samples) return?mean(samples) return? Why is that?na.rm=TRUE for the mean() function. Look up help (?mean) to see how it can be used. What happens now?is.na() function return when applied to samples?NA values? Use is.na and whichNA?The reason we are showing how to create a function is to show you that it is simple, and also because it will help you understand what functions are.
#' Function name
#' Function description
some_name <- function(param1, param2=2) {
## code comment
# <your code goes in here>
}
Much like vectors, matrices can only hold one data type (e.g. only numeric or only character or only logical etc.).
m <- matrix(1:18, ncol=3, nrow=6) # compare with m <- matrix(1:18, ncol=3, nrow=6, byrow=TRUE) dim(m) ncol(m) nrow(m)
matrix[row, column]
So, for example:
m[1, ] # vector which is the first row m[, 2] # vector which is the first column m[3, 1] # first element of the third row
Elements of a vector can be accessed not only using numbers (indices) or logical vectors. You can assign names to vectors:
person <- c("January", "Weiner", 134)
names(person) <- c("FirstName", "LastName", "Age")
person["FirstName"]
person["Age"]
We can name rows and columns of a matrix and use the names to access the rows and columns:
colnames(m) <- letters[1:ncol(m)] rownames(m) <- LETTERS[1:nrow(m)] m["A", "b"] # one "cell" m["B", ] # one row m[ , "b"] # one column
list() functionperson <- list(name="Weiner",
Age=NA,
given="January")
To access an element of a list, you need to use double brackets [[
person[["name"]]
There is a shortcut:
person$name
If you use single brackets [, you will get a piece of the “clothesline”, that is, you will produce a smaller list.
person["name"] class(person)
Caveats:
[[, not [names()), but don’t have toData frames are a bit like matrices, but every column can store different type of data. In this, they are more like lists (which they in fact are).
names <- c("January", "Manuela", "Bill")
lastn <- c("Weiner", "Benary", "Gates")
age <- c(1001, NA, 65)
d <- data.frame(names=names, last_names=lastn, age=age)
class(d)
class(d[,1])
class(d[,3])
You can access the data frame elements much like the elements of a matrix.
However, since data frames are lists, the list operator ($) also works:
d$names # same as d[,1] or d[, "names"] d$lastn d$lastn[1]
However, note that when you select a row, you will get a data frame, not a vector. This is because each of the column can be of different type, and vectors can hold only one type of data.
Caveats:
stringsAsFactors=FALSEGory details: matrices are a basic data type. Data frames are a list.
Caveats:
read_* functions return a tibblematrix and rnorm.as.data.frame for that.rep function for that.seq function for that.rep() is used to replicate vectors.
rep(c("A", "B"), 5)
# result:
# [1] "A" "B" "A" "B" "A" "B" "A" "B" "A" "B"
rep(c("A", "B"), each=5)
# result
# [1] "A" "A" "A" "A" "A" "B" "B" "B" "B" "B"
Main data types you will encounter:
| Data type | Function | Package | Notes |
|---|---|---|---|
| Columns separated by spaces | read_table() |
readr |
one or more spaces separate each column |
| TSV / TAB separated values | read_tsv() |
readr |
Delimiter is tab (\t). |
| CSV / comma separated | read_csv() |
readr |
Comma separated values |
| Any delimiter | read_delim() |
readr |
Customizable |
| XLS (old Excel) | read_xls() read_excel() |
readxl |
Just don’t use it. From the readxl package. |
| XLSX (new Excel) | read_xlsx() read_excel() |
readxl |
From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame! |
Note: there are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.
Read, inspect the following files:
TB_ORD_Gambia_Sutherland_biochemicals.csviris.csvmeta_data_botched.xlsxThe function readxl_example("deaths.xls") returns a file name. Read this file. How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel). How can you force the date columns to be interpreted as dates and not numbers?
tibbles belong to the tidyverse. They are nice to work with and very useful, but we can stick to data frames for now. Therefore, do
mydataframe <- as.data.frame(read_xlsx("file.xlsx"))
One crucial difference between tibble and data frame is that tibble[ , 1 ] returns a tibble, while dataframe[ , 1] returns a vector. The second crucial difference is that it does not support row names (on purpose!).